Attach libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
library(here)
## here() starts at C:/Users/jethu/Documents/R Studio - Stat 383/Lecture Code/DS241Portfolioz
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(ggplot2)

View the flights data frame

flights

Assign the flights data frame to variable df1

df1 <- flights 
df1

Task 1: Flights in September from Miami

df2 <-df1 |> 
  filter((month == 9) & (origin == "MIA"))
df2

There are no flights from Miami in this dataset.

Task 2: Flights in September going to Miami

df3 <- df1 |>
  filter(month == 9 & dest == "MIA")
df3

There were 912 flights going to Miami from a New York City airport in September 2013.

Task 4: Flights in January going to Miami

df4 <- df1 |>
  filter(month == 1 & dest == "MIA")
df4

There were 981 flights going to Miami in the month of January.

Task 5: Flights in summer going to Chicago

There are two major airports in Chicago included in this data set as destination airports. Below is the airport name and its IATA code. This information was taken from https://www.cleartrip.com/tourism/airports/chicago-airport.html,

Ohare International Airport: ORD

Chicago Midway International Airport: MDW

Since “summer” as presented in the task was not specifically defined, one way we can define it is with the general months widely associated with summer, which are June, July, and August.

df5 <- df1 |>
  filter(((month == 6) | (month == 7) | (month == 8)) & ((dest == "ORD") | (dest =="MDW")))
df5

There were 5,770 flights to either Ohare or Midway airports in the summer months of June, July and August.

We can make our search more specific by defining our range as the first day of summer and the last day of summer, which were June 6th, 2013 and September 21, 2013 respectively. These dates were taken from https://www.calendardate.com/year2013.php.

df5a <- df1 |>   
  filter((time_hour >= "2013-06-21") & (time_hour <= "2013-09-22")) |>
  filter((dest == "ORD") | (dest == "MDW")) 
df5

There were 5,843 flights to either Ohare airport or Midway airport from June 21, 2013 to September 21, 2013.

Task 6: Delays associated with flight number 86

Here we created a new variable “flight_nums” that only includes our target data, which are the flights to Miami in September. We then searched the “flight_nums” variable to find only the unique flight numbers. These unique flight numbers were then stored in the “flight_nums” variable.

flight_nums <- df1 |> filter((month == 9) & (dest == "MIA"))
flight_nums <- unique(flight_nums$flight)

Then, we created a new variable “df6” which contained only the minimum flight number. In this case, the minimum flight number was 86, which we got using the min() function on the “flight_nums” variable, as well as setting the destination to only planes going to Miami.

df6 <- df1 |>
  filter(flight == min(flight_nums), dest == "MIA")

Finally, we took the data in “df6” and created a visualization of department delays in relationship to arrival delays.

  df6 |> 
  ggplot(aes(x=dep_delay, y=arr_delay), ) + 
  geom_point() +
  labs(y = "Arrival delay (min)", x = "Departure delay (min)")

This gives us a scatter plot of flight number 83’s departure delays to arrival delays.

Looking into the documentation of the “nycflights13” package, we can see that negative departure delay values indicate early departures and negative arrival delays indicate early arrivals. This information was taken from https://cran.r-project.org/web/packages/nycflights13/nycflights13.pdf, or it can be viewed in the documentation which can be pulled up in the help window by running the code below and loooking at the dep_delay, arr_delay section.

?flights
## starting httpd help server ... done
ggplot(df6, aes(x = dep_delay, y=arr_delay)) +
  geom_point() +
  geom_smooth(method=lm, se=FALSE) +
  labs(x = "Departure Delay (min)", y = "Arrival Delay (min)")
## `geom_smooth()` using formula = 'y ~ x'

This visualization adds a line of best fit to show that there is a positive correlation between Departure Time and Arrival Time. This shows us that generally, the longer the departure time is in minutes, the longer the arrival delay would be. This intuitively makes sense, since if you leave after you plan to, you will arrive later than you plan to, as the travel time of the flight is not effected by the departure time. It does not matter whether you leave at 1:00 pm or 1:40 pm, the duration of the flight will stay relatively consistent no matter what time you depart.

Tasking at home: discover something more about delay

Does the time of day that the delay takes place affect the overall flying time? Is flight time affeted by delay departure. (Do the airlines try to “catch up”?) Does the departure delay change across time of day? (later in the day has more delays.) Is flight time pattern affected by time of year? Is departure delay affected by time of year?

A second visualization

df1 |> 
  filter(dest == "MIA") |>
  count(origin, sort=TRUE)

Is flight time affected by delayed departure?

I want to examine whether the flight time is impacted by delayed departure.

I want to compare flight time to “planned” flight time. So we create a new variable

flt_delta = arr_delay-dep_delay

A flight that arrives 10 minutes late, if it departed on time, had a “delta” of 10 minutes.

df7 <- df1 |>
  filter(dest == "MIA", origin =="LGA") |>
  mutate(flt_delta= arr_delay - dep_delay )
df7 |>
  ggplot(aes(x = dep_delay, y = flt_delta)) + geom_point(alpha = .1)
## Warning: Removed 79 rows containing missing values (`geom_point()`).

df7 |> 
  ggplot(aes(x = time_hour, y = dep_delay)) +
  geom_point(alpha = .1) +
  stat_smooth() +
  ylim(-25, 120)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 168 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 168 rows containing missing values (`geom_point()`).

Why are delays bigger in december than in january – probably not weather

Does the departure delay change across time of day?

df7 |> 
  ggplot(aes(x = hour + minute / 60, y = dep_delay)) +
  geom_point(alpha = .1) +
  stat_smooth() +
  ylim(-25, 120)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 168 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 168 rows containing missing values (`geom_point()`).

Observation: * Departure delay increases across the flight day

df7 |> 
  mutate(day_of_week = weekdays(time_hour)) |>
    ggplot(aes(x = hour + minute / 60, y = dep_delay, color = day_of_week)) +
    geom_point(alpha = .1) +
    stat_smooth() +
    ylim(-25, 120)  
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 168 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 168 rows containing missing values (`geom_point()`).

df7 |> 
  mutate(day_of_week = weekdays(time_hour)) |>
    ggplot(aes(x = hour + minute / 60, y = dep_delay, color = day_of_week)) +
    geom_point(alpha = .1) +
    stat_smooth() +
    ylim(-20, 40) + 
    facet_wrap(~day_of_week)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 478 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 478 rows containing missing values (`geom_point()`).